Bayesian neural networks for optimization and control

ABSTRACT

An optimization system is provided utilizing a Bayesian neural network calculation of a derivative wherein an output is optimized with respect to an input utilizing a stochastical method that averages over many regression models. This is done such that constraints from first principal models are incorporated in terms of prior art distributions.

CROSS REFERENCE TO THE RELATED APPLICATION

[0001] The present application is a Continuation Application ofapplication Ser. No. 09/290,791, filed Oct. 6, 1998, entitled: BAYESIANNEURAL NETWORK FOR OPTIMIZATION, which is a Continuation-in-Part of, andclaims priority in, U.S. Provisional Patent Application Serial No.60/103,269, entitled Bayesian Neural Networks For Optimization andControl, and filed Oct. 6, 1998 (Attorney Docket No. PAVI-24,473).

TECHNICAL FIELD OF THE INVENTION

[0002] The present invention pertains in general to neural networks foruse with optimization of plants and, more particularly, to the use ofBayesian-trained neural networks for optimization and control.

BACKGROUND OF THE INVENTION

[0003] In general, modeling techniques for a plant involve thegeneration of some type of model. This is typically done utilizing asingle model, either a linear model or a non-linear model. However,another technique of generating a model is to utilize a plurality ofmodels that can be utilized to define a predicted vector output y(t) ofvalues y₁(t), y₂(t), . . . , y_(q)(t) as a function of an input vectorx(t) of values x₁(t), x₂(t), . . . , x_(p)(t). For the purposes of thisapplication, a vector in the text shall be defined in bold and inequation form shall be defined with an overstrike arrow.

[0004] Given a set n of measured process data points:

D=({right arrow over (x)}₁, {right arrow over (y)}₁)=({right arrow over(x)}⁽¹⁾, {right arrow over (y)}⁽¹⁾)), ({right arrow over (x)}⁽²⁾, {rightarrow over (y)}⁽²⁾), . . . , ({right arrow over (x)}^((n)), {right arrowover (y)}^((n)))  (0001)

[0005] and assuming that an underlying mapping exists with the followingrelationship:

{right arrow over (y)}=F({right arrow over (x)})  (0002)

[0006] exists, a stochastical method for generating y(t) with respect tox(t) can be defined by averaging over many (non-linear) regressionmodels F^((w)). Given x(t), Fx(t) is approximated via a stochasticneural network training algorithm (non-linear regression) to the set offunctions F^((w))x(t), with “w” being the index for the number ofmodels, by fitting to the dataset (x(t), y(t)) in the dataset D.However, this only provides a forward predictive model and does notfacilitate the use thereof for optimization or control purposes.

SUMMARY OF THE INVENTION

[0007] The present invention disclosed and claimed herein comprises amethod for optimizing a system in which a plant is provided foroptimization. A training network having an input layer for receivinginputs to the plant, an output layer for outputting predicted outputs,and a hidden layer for storing a learned representation of the plant formapping the input layer to the output layer is also provided. A methodfor training the neural network in utilizing the stochastical method ofa Bayesian-type is provided.

[0008] In another aspect of the present invention, a method utilizingthe network in an optimization mode in feedback from the output of theplant to the input of the plant to optimize the output with respect tothe input via the stochastical Bayesian method is provided.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] For a more complete understanding of the present invention andthe advantages thereof, reference is now made to the followingdescription taken in conjunction with the accompanying Drawings inwhich:

[0010]FIG. 1 illustrates a block diagram of the present inventionutilizing the optimizer of the disclosed embodiment;

[0011]FIG. 2 illustrates a diagrammatic view of the optimizer of FIG. 1;

[0012]FIG. 3 a block diagram of the combination of the models utilizinga weighted average;

[0013]FIG. 4 illustrates a diagram depicting the training operation forthe network;

[0014]FIG. 5 illustrates a process flow for the training operation ofthe multiple models;

[0015]FIG. 6 illustrates a block diagram of the optimizer wherein asingle optimized value is determined averaged over all of the models;

[0016]FIG. 7 illustrates a block diagram depicting the optimizer whereineach model is optimized and then the optimized values averaged;

[0017]FIG. 8 illustrates a diagram for projecting a prediction over ahorizon for a dynamic model in accordance with the disclosed embodiment;

[0018]FIG. 8a illustrates a block diagram of a simplified embodiment ofFIG. 8;

[0019]FIG. 8b illustrates a distribution plot of the estimated x(t)values;

[0020]FIG. 9 illustrates a diagrammatic view of the optimization processfor control;

[0021]FIG. 10 illustrates a block diagram of the plant utilizing amultiple model feedback control network for predicting a trajectory overthe control horizon;

[0022]FIGS. 11 and 12 illustrate block diagrams of an the implementationof the network for a dynamic model;

[0023]FIG. 13 illustrates a block diagram of the dynamic modelcorresponding to FIG. 8;

[0024]FIG. 14 illustrates a block diagram of the dynamic model utilizinga steady state model to fix the gain; and

[0025]FIG. 15 illustrates an alternate embodiment of the dynamic modelof FIG. 14.

DETAILED DISCLOSURE OF THE INVENTION

[0026] Referring now to FIG. 1, there is illustrated a block diagram ofthe system of the disclosed embodiment for optimizing/controlling theoperation of a plant. A plant 10 is illustrated having a vector inputx(t) and a vector output y(t). The plant 10 is operable to receive theinputs x(t), perform the process and generate the outputs y(t). Theseinputs x(t) are variables that can be manipulated. In addition to theseinputs, there are also inputs which cannot be measured and also someinputs that cannot be manipulated. These are not shown. From thestandpoint of the outputs y(t), there are a number of types ofoutputs—there are measurable outputs and immeasurable outputs. Ingeneral, the output y(t) illustrated on an output 12 comprises all ofthe outputs, both measurable and immeasurable.

[0027] The values for operating the plant in the form of the x(t)variables are generated by a controller 14. The controller 14 generatesthese values in response to information received from an optimizer 16.The optimizer 16 is operable to receive all or select ones of theoutputs from the plant 10 and is also operable to receive the currentinputs x(t). Although not shown, the optimizer 16 can also receive suchthings as immeasurable inputs and inputs that cannot be manipulated.

[0028] The optimizer 16 is operable to provide an optimized set of inputvariables {circumflex over (x)}(t) input to the controller 14. Thisoptimizer 16 operates to generate optimized values by utilizing a set ofconstraints and also some type of optimization objective. Theoptimization objective defines the desired result of the optimizationoperation. For example, it may be that certain economic costs need to beconsidered when optimizing the operation of the plant. Constraints arebasically the restrictions placed upon the optimization process, such asgain, maximum range of values for the inputs x(t), etc., which will bedescribed in more detail hereinbelow.

[0029] Referring now to FIG. 2, there is illustrated a block diagram ofthe optimizer 16. The optimizer 16 includes a plurality of models 200each operable to map the values x(t) through a representation of theplant 10 to provide a predicted value y^((w))(t) on the output where “w”is the index of the model. The x(t) variables or values are mappedthrough the representation as follows:

{right arrow over (y)}(t)=F ^((w))({right arrow over (x)}(t))  (0003)

[0030] This will provide a plurality of the models 200, each in generalbeing different. As will be described hereinbelow, each of these modelsis trained to provide a stochastical method for optimization of theoutput with respect to the input by averaging over many (non-linear orlinear) progression models 200. Each of the models 200 is related to theother of the models 200 by a stochastical relationship. During training,as will be described hereinbelow, the models 200 are related to eachother such that each of the models have parameters that arestochastically related to each other, the models defined by therelationship:

{right arrow over (y)}=F ^((w))({right arrow over (x)})  (0004)

[0031] There are “w” of these models, such that there are alsoy^((w))(t) outputs provided, one for each of the models F^((w))x(t). Theindex “w” indicates these stochastically related models. Theoptimization operation utilizes these models in conjunction with theoptimization objectives and constraints to generate an optimized value{circumflex over (x)}(t) which is averaged over all of the models 200.

[0032] The method of optimizing an output with respect to inputsdescribed hereinabove, with the option of being subject to constraintsfrom first principles models, provides some advantages over the standardneural network methods primarily by giving rise to high qualitysolutions in the system identification phase in a parameter insensitiveway that avoids overfitting. Furthermore, by having a clean statisticalinterpretation, the approach easily lends itself to estimatingconfidence levels and related quantities.

[0033] The prediction operation will be described for the stochasticmethod in a more detailed manner in the following. The data is containedin a dataset D with an index n representing the portion thereof that isassociated with training. Indices exceeding n (n+1, n+2, . . . ) refersto data not included in the training process, this being the testingdata, and no index refers to an arbitrary data point. Subscripted valuesx_(p) and y_(q) refer to an arbitrary component of an x(t) or y(t)vector, respectively. Also in the formalism herein, it will be assumedthat the model outputs y(t) are subject to Gaussian noise. This choiceis just for pedagogical reasons—the method is valid for anydistribution, and it should be understood that other distributions couldexist.

[0034] In the first step, it is necessary to predict y^((n+k))(t)(k≧1),given the measured data D and a set of model functions F^((w))x(t) eachdescribed by a set of parameters ω (e.g. neural network weights), with ωbeing a vector value of weights with values ω₁, ω₂, . . . , ω_(r), “r”being the number of weights in the model. The average predicted outputy^((n+k))(t) is then given by: $\begin{matrix}{{\langle{\overset{}{y}}^{({n + k})}\rangle} \propto {\sum\limits_{w - 1}^{N_{w}}\quad {{F^{(w)}\left( {\overset{}{x}}^{({n + k})} \right)}{\prod\limits_{i = 1}^{n}\quad {P\left( {\overset{}{\omega}\left. D \right)} \right.}}}}} & (0005)\end{matrix}$

[0035] where P(ω|D) is the conditional probability (called theposterior) for the model F^((w))x(t) with weights ω given the dataset D.Using the well-known Bayesian relation and the fact that P(D) isconstant, the following relationship is obtained: $\begin{matrix}{{\langle{\overset{}{y}}^{({n + k})}\rangle} \propto {\sum\limits_{w = 1}^{N_{w}}\quad {{F^{(w)}\left( {\overset{}{x}}^{({n + k})} \right)}{\prod\limits_{i = 1}^{n}\quad {P\left( {y^{(i)}\left. {x^{(i)},\overset{}{\omega}} \right){P\left( \overset{}{\omega} \right)}} \right.}}}}} & (0006)\end{matrix}$

[0036] where$\prod\limits_{i = 1}^{n}\quad {P\left( {y^{(i)}\left. {x^{(i)},\overset{}{\omega}} \right)} \right.}$

[0037] is the likelihood, P(ω) is a prior distribution of the parametersor model weights ω, and their product is the posterior distribution.Assuming (not necessary) also a Gaussian distribution for the likelihooddistribution of the weights of the model, the average predicted outputrelationship is as follows: $\begin{matrix}{{{\langle{\overset{}{y}}^{({n + k})}\rangle} \propto {\sum\limits_{w = 1}^{N_{w}}\quad {{F^{(w)}\left( {\overset{}{x}}^{({n + k})} \right)}^{- {E{({\overset{}{\omega},\alpha,\beta,D})}}}}}}\text{where:}} & (0007) \\{{E\left( {\overset{}{\omega},\alpha,\beta,D} \right)} = {{{\frac{1}{2}{\sum\limits_{i = 1}^{n}\left( {{\overset{}{y}}^{(i)} - {F^{(w)}\left( {\overset{}{x}}^{(i)} \right)}} \right)^{2}}} + {\frac{\alpha}{2}{\sum\limits_{j = 1}^{N_{(w)}}\left( {\overset{}{\omega}}_{j} \right)^{2}}} + {\frac{\beta}{2}{H\left( {\overset{}{\omega},D} \right)}}} = {{- \log}\quad ({posterior})}}} & (0008)\end{matrix}$

[0038] the first term representing the summed square error over thedataset D with n being the number of patterns, and the second termcorresponding to the prior penalizes large weights (a regularizer). Thethird term, also part of the prior, is written as a generic constraintthat could include, for instance, fitting on different levels to firstprinciples knowledge. The value of i ranges over the dataset D, with nbeing the number of patterns.

[0039] Referring now to FIG. 3, there is illustrated a block diagram ofthe average predicted output y^((n+k))(t). Each of the models 200 hasthe output thereof input to a weighting block 302 which applies theweighting function e^(−E(ω, α, β, D)) to the output of each of themodels 200. The output of each of the weighting blocks 302 is then inputto a summing block 304 to provide the weighted average y^((n+k)).

[0040] In the situation wherein the models are utilized in a feedbackmode, i.e., for the purpose of predicting input values, and whichfeedback is utilized for control, the gain is an important factor.Therefore, during the training of the model, it is necessary to takeinto account the gain as one constraint on the training. This isreflected in the term H(ω, D) which, when gain constraints areconsidered, results in the following relationship: $\begin{matrix}{{H\left( {\overset{}{\omega},D} \right)} = {\sum\limits_{p,q}{\sum\limits_{i}{f\left( \frac{y_{q}^{(w)}}{x_{p}} \right)}}}} & (0009)\end{matrix}$

[0041] where f( ) measures whether the argument satisfies the knownconstraint, and the index i in the sum indicates the x^(i)(t) vectorpoint at which the derivative is evaluated. The adjustable parameterfamilies α and β are drawn from fixed prior distributions described byα₀ and β₀, and also it is noted that the derivative(dy_(q)(t)^((w))/dx_(p)(t)) is the derivative for the model F^((w))x(t)(in Equation (5)) summed over the (q,p) matrix of inputs/outputs. Thisprovides the (q,p) matrix of gains for each model. It is noted thatminimizing the error value E corresponds to standard neural networklearning that would give rise to one solution (or network, or model)ω=ω*, these being the weights necessary to minimize the error.

[0042] The models F^((w))x(t) are generated for estimating<y^((n+k))(t)> (in Equation (7)) by making moves in the high dimensionalparameter space (ω, α, β). Since the distribution overF^((w))x^((n+k))(t) is averaged over a strongly peaked distribution,this is typically done using importance sampling Monte Carlo algorithms,such as the Langevin method, or type Metropolis with Hybrid Monte Carlo,and/or tempering extensions to ensure that the entire search space isproperly probed. (A wide variety of other methods are of course possibleand these noted methods are not considered exhaustive.) The error E forthis model is monitored as the parameter updates proceed. Initially, itslong-time average drops and then it flattens off to a “thermalized”static behavior, from which configurations are used to compute<y^((n+k))> in Equation (7). In other words, the summation over N_(w) inEquation (7) is restricted to an interval [N_(min), N_(max)], whereN_(min) (w=1) is given by the onset of static behavior, and N_(max) isset by the required accuracy of <y^((n+k))>, and in the case of processoptimization, which is described hereinbelow, by the availablecomputational time. This provides an ensemble of N_(w)=N_(max)−N_(min)models F^((w))x(t) trained to be valid for each of the n data recordsover the dataset D. In the above, this ensemble was used for predictingthe outputs y^((n+k))(t) when only the corresponding inputs x^((n+k))(t)are available (Equation (7)).

[0043] Referring now to FIG. 4, there is illustrated a diagrammatic viewfor the “thermalized” static behavior utilizing the training operationand, also referring to FIG. 5, there is illustrated a diagrammatic flowfor the training operation. Both of FIGS. 4 and 5 will be referred to.With specific reference to FIG. 5, the training operation is initiatedby a set of weights ω₁, represented by a block 502. These weights areutilized to generate a first model F⁽¹⁾x(t), represented by a block 504.The model is then tested in a test block 506 by utilizing the test datain the portion of the dataset n+k, the data above the training data n.This generates a first error value E₁. This data is then utilized togenerate the next set of weights ω₂. The above-noted directed randomsearch techniques, such as the Langevin method, are utilized to generatethese new weights by utilizing the weights from the previous model inthe progression as the starting point. Therefore, they will have therelationship ω₂=f(D, ω₁). This is represented in a block 508. Theseweights are utilized to train a second model F⁽²⁾x(t), as represented inblock 510, this representing a progression model. This is a model thathas a stochastic relationship to the first model in block 504. Thismodel is subsequently tested in a block 512 to determine the new errorE₂ for that model. A new set of weights ω₃ is generated by thestochastic method, as represented by a block 514. This will be utilizedto generate a third model F⁽³⁾x(t), as represented by a block 516. Thiswill also be tested in a block 518 to generate an error value E₃.

[0044] The steps noted hereinabove, between blocks 502 and 518, willcontinue for a number of progressive models, resulting initially in thegeneration of errors that represent the un-thermalized behavior in aportion of 402 of the diagrammatic view of FIG. 4. This will continueuntil the error is reduced. At this time, a model N_(min) will begenerated with weights ω_(min), as represented by a block 520. This willbe represented by the relationship ω_(min)=f(D, ω_(min−1)) asrepresented by the block 520. This will be utilized to generated themodel F^((min))x(t), which is then tested at a block 524 to generate theerror value E_(min). This represents the first model for the valueN_(min) such that for the next N_(w−1) models up to the value ofN_(max), there will be N_(w) models. The model associated with N_(max)is represented in block 526, which model is then tested in block 528 togenerate the error value E_(max). In general, as represented by thediagram in the diagrammatic view of FIG. 4, there is a “sliding” groupof N_(w) adjacent models maintained. Depending upon the processingcapabilities of the system, this number can be any range. In thedisclosed embodiment, as an example only, five hundred models areutilized. This results in the earliest model over five hundred beingdiscarded such that only five hundred models are maintained. Thetraining continues until the model in block 522 and the model in block562 are both within the region represented by reference numeral 404 inFIG. 4. Therefore, all of the models will be in the “thermalized”region.

[0045] In addition to being able to compute y(t)=F^((w))x(t) from x(t)for each model in the ensemble (from which the final average <y(t)> iscomputed), another essential computation for most process applicationsis to be able to compute the (q, p) matrix of derivatives ∂y_(q)^((w))(t)/∂x_(p)(t) for each model F^((w))x(t) in the ensemble, at anyvector point x(t), the (q,p) matrix representing the matrix ofderivatives indexed by the inputs and outputs for amulti-input/multi-output system. These derivatives are necessary for atleast two fundamental purposes:

[0046] (1) sensitivity analysis, and

[0047] (2) optimization and/or control

[0048] Each of the N_(w) ensembles of F^((w)) models is a continuousfunction; therefore each derivative ∂y_(q) ^((w))(t)/∂x_(p)(t) may beevaluated at any vector point x(t) by the elementary rules ofcalculating derivatives.

[0049] The average derivatives, weighted over the ensemble of models,can then be calculated by the following relationship: $\begin{matrix}{{\frac{\partial{\langle{y_{q}(t)}\rangle}}{\partial{x_{p}(t)}}\left( {\overset{}{x}(t)} \right)}\quad \propto {\sum\limits_{w = 1}^{N_{w}}{\frac{\partial{y_{q}(t)}}{\partial{x_{p}(t)}}{\overset{}{x}(t)}{\prod\limits_{i = 1}^{n}\quad {P\left( {y^{(i)}\left. {x^{(i)},\overset{}{\omega}} \right){P\left( \overset{}{\omega} \right)}} \right.}}}}} & (0010)\end{matrix}$

[0050] In this relationship the values for derivatives are averaged overthe models. To reduce computation time, it may be desirable to estimateEquation (10) instead of computing it fully. The best single-termestimate would be the one with the largest posterior (or probabilityweighting factor) for weighting the gains:

Π_(i=1) ^(n) P(y ^((i)) |x ^((i)), {right arrow over (ω)})P({right arrowover (ω)})  (0011)

[0051] In Bayesian terminology, any such estimate is called the MAP(maximum a posteriori) estimate. In order for this MAP estimate tosignificantly reduce computing time, it would be necessary to haveaccess to the ensemble of models already sorted in posterior magnitudeorder: a sorted index to the models at the completion of the trainingprocedure could quickly and easily be created. Since this would be doneonly once, the required computing time would be insignificant.

[0052] Referring now to FIG. 6, there is illustrated a block diagramdepicting the operation illustrated in Equation (10), the models 200each having outputs thereof input to a weighted average block to providethe weighted average output <y(t)>. In order to provide the derivatives,the values on the output of each of the models 200 must beback-propagated through the model to determine the derivative thereof,or any other technique that Will provide that derivative. Thebackpropagated method is basically a recursive approach. This isrepresented by a derivative block 600 for each of the models 200. Thederivative block 600 is operable to receive the x(t) value and theoutput of the associated model 200 y^((w))(t). The output of thederivative block 600 is ∂y^((w))(t)/∂x(t). Each of the derivativesoutput by each of the blocks 600 are input a weighting block 604 whichis operable to provide a weight to each of the derivatives which arethen summed in a summing block 606. This provides the weighted averageof the derivative ∂<y(t)>/∂x(t) over all of the models.

[0053] This basic idea of estimating using the single MAP model can beiterated to improve the estimation to the desired level of accuracy. Thesecond level estimate would consist of taking the model with the nexthighest posterior (next model in the indexed list) and averaging it withthe first (the MAP) model, to yield a two-model average. This processcould be iterated, incrementally improving the estimate, with somestopping criterion defined to halt the procedure. A stopping criterionmight be to halt when the change in the estimate due to adding the nextmodel is less than some threshold. The extreme of this process is ofcourse the full sum of Equation (10).

[0054] The above discussion of Equation (10) or its estimation, involvedtaking the derivative ∂y_(q) ^((w))(t)/∂x_(p)(t) evaluated at a givenvector point x(t), and computing their (posterior weighted) average overthe ensemble of models. Sensitivity analysis examines statistics overthe dataset of these ensemble-averaged derivatives. Consider, forinstance, taking the absolute value of the ensemble-averagedderivatives, and averaging them over the dataset: this information wouldindicate the overall relative strengths, over the historical operatingconditions represented by the dataset, of the effect of each inputvariable x_(p)(t) on a given output variable y_(q)(t). Thisdouble-average derivative could be written: $\begin{matrix}{{{\langle{\frac{\partial{\langle{y_{q}(t)}\rangle}}{\partial{x_{p}(t)}}}\rangle}D} \propto {\sum\limits_{j}{\sum\limits_{\omega}{{{\frac{\partial{\langle{y_{q}^{(\omega)}(t)}\rangle}}{\partial{x_{p}(t)}}\left( {{\overset{}{x}}^{(j)}(t)} \right)}}{\prod\limits_{i = 1}^{n}\quad {P\left( {{{\overset{}{y}}^{(i)}(t)}\left. {{{\overset{}{x}}^{(i)}(t)},\overset{}{\omega}} \right){P\left( \overset{}{\omega} \right)}} \right.}}}}}} & (0012)\end{matrix}$

[0055] where <>D indicates the average over the dataset of vector pointsx^(j)(t).

[0056] In addition, statistics over the dataset other than the averagecan often yield useful information, such as the median, the standarddeviation, and so forth.

[0057] Optimization/Control

[0058] Process optimization ordinarily refers to determining the optimalinput vector {circumflex over (x)}(t) that will minimize a definedobjective function J while satisfying any defined constraint functionsC_(m). J is ordinarily a function of the process model and itsvariables, which expresses the desired characteristics of processoperation, output product characteristics, and so forth. The C_(m)functions are more often (though not always) a function only of theinput variables, which express relationships among the process variableswhich must hold for physical or desired operational reasons; forexample, a mass-balance constraint might dictate that x1=x2+x3. A validsolution of a constrained optimization problem always satisfies theC_(m) relationship, and minimizes J as well as possible within thoseconstraints.

[0059] “Optimization” typically means “steady-state optimization”(finding an optimal point in operating space using a steady-statemodel), while “control” typically means “dynamic control” (finding anoptimal trajectory in operating space using a dynamic model). Both are“optimization problems.”

[0060] In optimization or control, an optimization algorithm uses theprocess model to find the optimal {circumflex over (x)}(t), given theobjective J and constraint C_(m) functions. Neural network models are ingeneral nonlinear, so nonlinear optimization algorithms are used.Unconstrained or constrained optimization is performed depending uponwhether or not any constraint functions are defined. Mathematically,unconstrained and constrained nonlinear optimizations are verydifferent, and different optimization algorithms are used. Henceforththe general (and most typical in industrial processes) case ofconstrained nonlinear optimization will be assumed.

[0061] Nonlinear constrained optimization algorithms that make use ofderivatives generally execute much faster than those that do not. Avariety of such nonlinear constrained optimization programs arecommercially available. The most popular codes are based on theSequential Quadratic Programming (SQP) or the Generalized ReducedGradient (GRG) methods.

[0062] A prototypical objective function isJ=Σ_(k)(<y^((n+k))>−y^((n+k)))², i.e., the sum over all non-trainingdatapoints of the squared difference between a desired (setpoint) outputvalue y^((n+k))(t) and the Bayesian model output <y^((n+k))(t)>. A moregeneral example of an objective function is one representing the(negative, as the objective function is to be minimized) profit of theprocess (by associating prices and costs with the input and outputvariables). One possibility would be to then use the resulting outputvariable values as setpoints for those output variables.

[0063] In order to optimize the objective function J with respect to theinput variable x(t), it is necessary to determine ∂J/∂x_(p)(t).Utilizing simple rules for derivatives, it is relatively easy todetermine the optimized value for the Bayesian model output <y(t)> bythe relationship ∂J/∂<y(t)>. However, utilizing the multiple models, theweighted average must be factored into the calculation in order todetermine the optimization with respect to the optimization of theobjective function J with respect to the input x(t). In order todetermine to this, the chain rule is utilized. By the chain rule, thefollowing holds: $\begin{matrix}{{\frac{\partial J}{\partial{x_{p}(t)}}}_{p} = {\sum\limits_{q}{\frac{\partial J}{\partial{\langle{y_{p}(t)}\rangle}}\frac{\partial{\langle{y_{p}(t)}\rangle}}{\partial{x_{p}(t)}}}}} & (0013)\end{matrix}$

[0064] because the output variables <y_(q)(t)> are referencedexplicitly, the first factor ∂J/∂<y_(q)(t)> is computable by theelementary rules of differentiation for each input x_(p)(t). The secondfactor, however, involves the Bayesian ensemble of models, which relate<y_(q)(t)> to x_(p)(t), thus representing the sum over all models.

[0065] Therefore, for purposes of (derivative-based) processoptimization or control using Bayesian modeling, the (q, p) derivativematrix ∂y_(q) ^((w))(t)/∂x_(p)(t) from Equation (10), one for each modelF^((w))x(t) in the ensemble, is the fundamental quantity required. Fromthese matrices, the values ∂<y_(q)(t)>/∂x_(p)(t), ∂J/∂x_(p)(t) can bedetermined, and, if the y_(q)(t) values are referenced in any C_(m)functions, ∂C_(m)/∂x_(p)(t) can be computed by the relationship:$\begin{matrix}{{\frac{\partial C_{m}}{\partial{x_{p}(t)}}}_{p} = {\sum\limits_{q}{\frac{\partial C_{m}}{\partial{\langle{y_{p}(t)}\rangle}}\frac{\partial{\langle{y_{p}(t)}\rangle}}{\partial{x_{p}(t)}}}}} & (0014)\end{matrix}$

[0066] In each method, any nonlinear constrained optimization code, suchas an SQP or GRG code, may be used to perform the optimization. Any suchcode searches the x-space for the x-value that will minimize the given Jwhile satisfying the given C_(m) by iteratively passing to user-suppliedsubroutines an x(t) value (in general, different at each iteration) andreceiving back the objective and constraint functions, and thederivatives, all evaluated at x(t). The derivatives may of course becomputed in full or may be estimated to any degree of accuracy, asdescribed hereinabove.

[0067] There are at least two fundamentally different ways thatoptimization over a Bayesian ensemble of models may be carried out.Roughly speaking, method (1) performs a single optimization over theentire ensemble, and method (2) performs multiple optimizations, one foreach model in the ensemble, and when finished combines the results.

[0068] Optimization Method (1):

[0069] In this method, the optimization routine performs a singleoptimization over all of the models in the ensemble and returns a singleoptimal value for {circumflex over (x)}(t). When the optimizer requeststhe values of the functions and derivatives evaluated at a point x(t), auser-supplied subroutine must compute the derivative values∂<y_(q)(t)>/∂x_(p)(t) by applying the chain rule of Equation (13) to theentire ensemble of models. After some number of iterations of suchsubroutine calls, the single optimization procedure terminates with anoptimal {circumflex over (x)}(t).

[0070] Referring now to FIG. 7, there is illustrated a block diagram ofthe first optimization method. In this method, the models 200 areprovided, each for receiving the input value x(t) and outputting theoutput value y(t). There are provided N_(w) models F^((w))(x(t)). Theoutput of each of the models 200 is input to the block 600 which isoperable to determine the derivative on the output thereof. Each of thederivatives from each of the blocks 600 for each of the models 200 areinput to a weighted average block 702 which is operable to provide theweighted average of the derivative as set forth in Equation (10). Thisis then subjected to the chain rule of Equation (13) via a block 706 toprovide the single optimal value for {circumflex over (x)}(t). Note thatthis is performed at each value of x(t), as noted hereinabove withrespect to Equation (13). Also, it is noted that the weighted averageoperation is performed prior to applying the chain rule. To summarizemethod (1), the probability distribution P(ω|D) is provided for a singlemodel with the other probability distribution then sampled utilizing theoptimizer.

[0071] Optimization Method (2):

[0072] In this method, each model in the Bayesian ensemble is optimizedseparately, yielding an optimal {circumflex over (x)}^((w))(t) for eachmodel F^((w))x(t). During the optimization of each model, anoptimization process for a model F^((w))x(t) requests function andderivative values at a point x_(p)(t). The functions and derivativevalues returned are for that single model only, that is,∂y_(q)(t)^((w))/∂x_(p)(t). The optimization of each model 200 terminateswith an optimal value {circumflex over (x)}^((w))(t) for that model,such that there are N_(w) such optimized values. At the completion ofall optimizations, some operation, such as a weighted average, isperformed over the set of {circumflex over (x)}^((w))(t) values tooutput a single optimal {circumflex over (x)}(t), e.g.: $\begin{matrix}{{\langle{\hat{x}\left( t\rangle \right.}\rangle} \propto {\sum\limits_{w = 1}^{N_{w}}\quad {{{\hat{x}}^{(w)}(t)}{\prod\limits_{i = 1}^{n}\quad {P\left( {{\overset{}{y}}^{(i)}\left. {{\overset{}{x}}^{(i)},\overset{}{\omega}} \right){P\left( \overset{}{\omega} \right)}} \right.}}}}} & (15)\end{matrix}$

[0073] In addition, the distribution of {circumflex over (x)}(t) valuesmay hold useful information for process operation in addition to thesingle averages. It should be understood that combinations of these twofundamental optimization methods are to be considered and that thedisclosed embodiment is not exhaustive.

[0074] Referring now to FIG. 8, there is illustrated a block diagramdepicting the second method of optimization. In this diagrammatic view,the models 200 and derivative blocks 600 are provided for generating thederivatives for each of the associated models 200. The output of each ofthe derivative blocks 600 is, as distinguished from the embodiment ofFIG. 4, input to a block 800, wherein that value is optimized over theoutput such that the optimized value ∂J/∂y^((w))(t) provides the value{circumflex over (x)}^((w))(t) for each model. These are then processedthrough a weighted average block 804, which is operable to implementEquation (11) to provide a single optimal {circumflex over (x)}(t). Tosummarize this optimization method, the probability distributionP({circumflex over (x)}, {right arrow over (ω)}|D, J, C) provides theprobability distribution over the {circumflex over (x)}(t), given D, Jand possibly C.

[0075] Referring now to FIG. 8a, there is illustrated a simplified blockdiagram of the embodiment of FIG. 8, wherein a single block 810represents the combination of the models 200, derivative block 600 andmultiplication blocks 800. This, as noted hereinabove, is operable toprovide a plurality of estimated values for each model 200 in the formof {circumflex over (x)}⁽¹⁾(t), wherein there are w estimated values forthe input, one for each model. These are all input to a block 812 whichblock is operable to provide some type of selection or averagingoperation, this being similar to the block 804 which performsspecifically a weighted average. However, it should be understood thatthe selection operation in block 812 can utilize any criteria. Forexample, it could be a weighted average as described hereinabove, or itcould be some type of selection criteria that selected the best singlevalue. FIG. 8b illustrates a plot of the estimated input values as afunction the index value where it can be seen that there is a Gaussiandistribution of such values. The important aspect of this block 804 isthat any type of selection criteria can be utilized to provide somevalue that is a function of all of the estimated values or someselection criteria that eliminates certain ones of the values andselects other values for an averaging type operation. In such a manner,all of the estimated values need not be utilized for the selectioncriteria.

[0076] Dynamic Models

[0077] The above discussion has been described with respect tosteady-state process models. The indices k described hereinabovedescribe new data (n+k), whereas a dynamic system utilizes the index kto represent time intervals, which need not represent equally spacedtime intervals. In general, a trajectory of output values {y(t+1) . . .y(t+k_(max))}(y_(x)(t)) is predicted from a current y(t) for each new“setpoint” y(t) along with the a corresponding trajectory of controlinput values (corresponding to the x(t) inputs for the steady stateprocess) necessary to provide such y_(x)(t), which trajectory takes apredetermined number of the time intervals to achieve. The trajectory ofcontrol input values is defined for k=1 to k=k_(max) as {u(t+1) . . .u(t+k_(max))}(u_(k)(t)). The system will then make a move along thistrajectory for a first interval of time and even additional intervals oftimes by predicting the new u(t) values necessary to make such a movebefore the next dynamic prediction is made. This dynamic predictioncould be made at each move interval such that a new trajectory for thesetpoint is predicted from a new control input value u(t). As such, thenew “setpoint” optimization could be performed at each time interval.

[0078] When using the above-described optimization/control proceduresfor dynamic models, which are iterated out to the control horizon intime, the optimization is performed over the entire trajectory (timeinterval (t+1, t+k_(max)), where (t+1) represents the current timeinterval). The first step in the optimal trajectory is taken by theplant, and the whole optimization is begun again from that point. Thisis the generic model predictive control picture. The optimization overthe trajectory is subject to an objective function and a variety ofconstraints, such as “hard” constraints on the trajectory which are wideat (t+1) and converge (with some specified tolerance) to the outputsetpoint at (t+k_(max)). In concept, the control optimization result isthe optimal trajectory specified by values at each time increment.However, most model predictive control packages, including thatdescribed in U.S. patent application Ser. No. 08/643,464, filed May 6,1996, and entitled “Method and Apparatus for Modeling Dynamic andSteady-State Processes for Prediction, Control and Optimization,”incorporated by reference herein, use “blocking” to speed up theoptimization, i.e., the control horizon is divided into t intervals,usually spaced closer near (t+1) and wider near (t+k_(max)) (logarithmicspacing). The number of independent optimization variables is thus onlym times the number of control input variables, which shortenscomputation time dramatically with (almost always) very littledifference in the first increment of the optimal trajectory, and makesit computationally tractable. The same choice between the twofundamentally different optimization methods described above apply justas clearly in the dynamic case.

[0079] Referring now to FIG. 9, there is illustrated a diagrammatic viewof the trajectory of y_(k)(t) determined in the optimization process forcontrol. In this process, the time horizon is divided up into mincrements or intervals, these being equal time segments or unequal timesegments. It can be seen that the dynamic prediction for the value ofy_(k)(t) is made along the increments from the current position in time(t+1) and extending out to the horizon at the value of (t+k). This finaly^((n+k))(t) value is often equal (within some tolerances) to the outputsetpoint, provided the optimal change to u(t) was made.

[0080] Referring now to FIG. 10, there is illustrated a block diagramfor the plant 10 utilizing a controller network 1002 in feedback. Thisis very similar to the optimizer 16 of FIG. 1, with the exception thatit predicts a trajectory utilizing cost function J, constraints C_(m),and various setpoints. The setpoints are typically in the form of adesired move in the output vector y(t). The controller network 1002 willreceive as inputs select ones of the output y(t), and project thetrajectory out over the control horizon in time and the u(t) valuesnecessary to achieve such trajectory. This will typically be predictedin time intervals such that a dynamic plant controller 1004 will be ableto generate new input values u(t).

[0081] In utilizing dynamic models, the models can be of differingtypes. In the disclosed embodiment, the dynamic model is a linear modelwhich is defined by the following relationship:

{right arrow over (y)}_(k) =G({right arrow over (y)} _(k−1) , {rightarrow over (y)} _(k−2) , {right arrow over (u)} _(k−1−d) , {right arrowover (u)} _(k−2−d) a, b)  (0016)

[0082] where:

[0083] y_(k−1), y_(k−2) are outputs at different points in time;

[0084] u_(k−1−d), u_(k−2−d) are input values at different points intime;

[0085] d is a delay value; and

[0086] a, b are the parameters of the linear model.

[0087] One example of the linear model is set by the followingrelationship:

{right arrow over (y)}_(k) =−a ₁{right arrow over (y)}_(k−1) −a ₂{rightarrow over (y)}_(k−2) +b ₁{right arrow over (u)}_(k−1−d) +b ₂{rightarrow over (u)}_(k−2−d)  (0017)

[0088] Although Equation (17) is set forth as a linear equation with alinear model, additional non-linear terms can be attached to thisequation to result in a non-linear model. However, the parameters ofthis model are set by the a's and b's, i.e., the parameter values of themodel. This also pertains to the gain, this described in detail in U.S.patent application Ser. No. 08/643,464, which was incorporated byreference hereinabove.

[0089] When identifying the stochastically-related model via the varioustechniques described hereinabove, the disclosed one being the Bayesiantechnique, the models are trained in substantially the same way as thenon-linear and neural networks, described with respect to thesteady-state process hereinabove. This will yield w models which arestochastically-related by the following relationship:

{right arrow over (y)} _(k) ^((w)) =−a ₁ ^((w)) {right arrow over (y)}_(k−1) ^((w)) −a ₂ ^((w)) {right arrow over (y)} _(k−2) ^((w)) +b ₁^((w)) u _(k−1−d) +b ₂ ^((w)) u _(k−2−d)  (0018)

[0090] This will provide the models y_(k)=G^(w) such that there are w ofthe stochastically-related dynamic models. As was set forth hereinabove,the first step is to predict the output value from each of the modelsgiven the measured dataset D and the set of model functions G^((w)),each described by a set of parameters a, b. The average predicted outputis then given by: $\begin{matrix}{{\langle{\overset{}{y}}_{k}^{n}\rangle} = {\sum\limits_{w = 1}^{N_{w}}\quad {{G^{(w)}\left( {\overset{}{u}}^{n} \right)}{P\left( {a,{b\left. D \right)}} \right.}}}} & (0019)\end{matrix}$

[0091] where P(a, b|D) is a conditional probability for the modelG^((w))with parameters a and b, given the dataset D. With the well-knownBayesian relation as described hereinabove, and the fact the P(D) isconstant, the following relationship is obtained: $\begin{matrix}{{\langle{\overset{}{y}}_{k}^{n}\rangle} \propto {\sum\limits_{w = 1}^{N_{w}}\quad {{G^{(w)}\left( {\overset{}{u}}^{n} \right)}{\prod\limits_{i - 1}^{n}\quad {P\left( {{\overset{}{y}}^{(i)}\left. {{\overset{}{x}}^{(i)},a,b} \right){P\left( {a,b} \right)}} \right.}}}}} & (0020)\end{matrix}$

[0092] where$\left. {{\prod\limits_{i - 1}^{n}\quad {{P\left( {\overset{}{y}}^{(i)} \right.}{\overset{}{x}}^{(i)}}},a,b} \right)$

[0093] is the likelihood. P(a, b) is the prior distribution of theparameters (a, b) of the model, and their product is the posteriordistribution, as was described hereinabove with respect to thesteady-state case. All of the above-noted equations apply to the dynamiccase. The only difference is that the input is now u(t) and theparameters of the model are (a, b), as compared to ω.

[0094] In order to perform a sensitivity analysis or to perform anoptimization and/or control, each of the N_(w) ensembles of G^((w))models must have the derivative thereof determined by the followingrelationship: $\begin{matrix}\frac{\partial{y_{q,k}^{(w)}(t)}}{\partial{u_{p}(t)}} & (21)\end{matrix}$

[0095] As noted hereinabove, it is then necessary to determine theaverage derivatives, weighted over the ensemble of models by thefollowing relationship similar to Equation (21): $\begin{matrix}{\left. {{{\frac{\partial{\langle{y_{q,k}(t)}\rangle}}{\partial{u_{p}(t)}}\left( {\overset{}{u}(t)} \right)} \propto {\sum\limits_{w = 1}^{N_{w}}\quad {\frac{\partial{y_{q,k}(t)}}{\partial{u_{p}(t)}}\left( {\overset{}{u}(t)} \right){\prod\limits_{i = 1}^{n}\quad {{P\left( {\overset{}{y}}^{(i)} \right.}{\overset{}{x}}^{(i)}}}}}},a,b} \right){P\left( {a,b} \right)}} & (0022)\end{matrix}$

[0096] Referring now to FIG. 11, there is illustrated a block diagramdepicting the operation illustrated in Equation (22) for a dynamic modelto determine the average derivative. This basically parallels theoperation of the embodiment in the FIG. 6, described hereinabove withrespect to steady-state models. There are provided a plurality ofdynamic models 1100 corresponding to the static models 200 describedhereinabove which are operable to provide a predicted value y^((w))(t).The output of each of these models 1100 can be input to a weightedaverage block 1102 to provide the value <y(t)>. However, in order toprovide the prediction for the average derivative ∂<y(t)>/∂u(t), it isnecessary to optimize the derivative as described hereinabove in theoptimization method (1). This requires feeding both the input value u(t)and the output y^((w))(t) for each model 1100 to a derivative block1104. This provides the derivative for each value of w∂y^((w))(t)/∂u(t).These derivatives for each of the models are then input to an averagingblock 1106 to take the weighted average thereof, and then to a summingblock 1108 to provide the average derivative ∂<(y(t)>/∂u(t).

[0097] Once the average derivative is determined for the dynamic model,then this can be optimized, utilizing the optimization method (1) or theoptimization method (2) described hereinabove, except that a dynamicmodel is used. This is illustrated in FIG. 12 which parallels FIG. 7 forthe static model. The derivatives of each of the models output from thederivative block 1104 are first subjected to the weighted average in ablock 1200, representing block 1106 and 1108, and then the average ofthe derivatives is then multiplied by the optimization objective J_(D)for the dynamic model condition. This basically performs the chain ruleby multiplying the derivative of J_(D) with respect to the averageoutput, ∂J_(D)/∂<y(t)> to provide on the output the derivative of theoptimization objection J_(D) with respect to u(t), ∂J_(D)/∂<u(t)>. Thisessentially is the optimization method one for the dynamic case.

[0098] In optimization method (2), the dynamic model representation isillustrated in FIG. 13, which parallels FIG. 8. Each of the models 1100has the derivative thereof determined by the derivative block 1104.However, rather than take the weighted average of the derivatives, eachû^((w))(t+k_(min)) . . . û^((w))(t+k_(max))(û_(k) ^((w))(t)) of themodels has the derivative thereof optimized to provide an estimatedcontrol trajectory û^((w))(t) for each value of w. These are theestimated values of the inputs for each model which are then processedthrough a weighted average block 1304 which, as described hereinabove,is operable to perform some type of algorithm or selection processthereon. This can be any type of selection process for the model. Itcould be an averaging operation, or it could be a selection process.This provides a single control trajectory û(t) value. Given the controltrajectory û(t), the first value thereof is input to the plant.

[0099] In U.S. patent application Ser. No. 08/643,464, incorporatedherein by reference, there was disclosed a technique for defining thegain of dynamic models as a function of the gain of the steady-stateneural network model. The gain of the steady-state model is referred toby the term “K_(ss),” and the gain of the dynamic model is referred toas “k_(d).” The relationship for the dynamic gain, k_(d), is defined interms of the parameters (a, b) of the dynamic model as follows:$\begin{matrix}{k_{d} = \frac{\sum\limits_{i = 1}^{n}\quad b_{i}}{1 + {\sum\limits_{i = 1}^{n}\quad a_{i}}}} & (0023)\end{matrix}$

[0100] Since the gain K_(ss) of the steady-state model is known, thedynamic gain k_(d) of the dynamic model can then be forced to match thegain of the steady-state model by scaling the b_(i) parameters. Thevalues of the static and dynamic gains are set equal with a value ofb_(i) scaled by the ratio of the two gains as follows: $\begin{matrix}{\left( b_{i} \right)_{seated} = {\left( b_{i} \right)_{old}\left( \frac{K_{ss}}{k_{d}} \right)}} & (0024) \\{\left( b_{i} \right)_{seated} = \frac{\left( b_{i} \right)_{old}{K_{ss}\left( {1 + {\sum\limits_{i = 1}^{n}\quad a_{i}}} \right)}}{\sum\limits_{i - 1}^{n}\quad b_{i}}} & (0025)\end{matrix}$

[0101] This makes a dynamic model consistent with its steady-statecounterpart, as described in U.S. patent application Ser. No.08/643,464, which was incorporated by reference hereinabove. Therefore,each time the steady-state value changes such that the operating regionof the steady-state model is different, this will correspond to apotentially different gain K_(ss) for the steady-state model. This valuecan then be utilized to update the gain k_(d) of the dynamic model and,therefore, compensate for the errors associated with a dynamic model,wherein the value of k_(d) is determined based on perturbations in theplant on a given set of operating conditions. Since all operatingconditions are not modeled, the step of varying the gain will accountfor changes in the steady-state starting points. With respect to thepresent application utilizing stochastically related models, it isnecessary to determine the dynamic gain of each of the dynamic models1100.

[0102] Referring now to FIG. 14, there is illustrated a block diagram ofthe optimizer 16 utilizing a steady-state model to fix the gain of thedynamic model 1100. The optimizer, as described hereinabove, is dividedinto a plurality of blocks for determining the derivative of∂y^((w))(t)/∂x(t). These blocks are referred to with the referencenumeral 1400. There are provided “w” of these blocks 1400, the output ofeach of these inputs input to a product block 1405, the outputs thereofsummed in a summing block 1406 to provide on the output thereof theaverage or weighted derivative ∂<y(t)>/∂x(t).

[0103] Each of the blocks 1400 has associated therewith the dynamicmodel 1100 with the input u(t) and the output y(t) input to thederivative block 1104. Additionally, a steady-state model 200 isprovided for each of the dynamic models 1100 in each of the blocks 1400.Therefore, each of the models 200 is a stochastically related modelF^((w))x(t) which has an associated steady-state gain K^((w)) _(ss).This is a known gain for that model. In the embodiment illustrated, thesteady-state model 200 for w=1 is associated with the G⁽¹⁾u(t) model1100. A gain modulator 1410 is provided for determining the dynamic gaink⁽¹⁾ _(d). In the preferred embodiment, as set forth in U.S. patentapplication Ser. No. 08/643,646, the dynamic gain is forced to be equalto the steady-state gain. This provides the dynamic gain k⁽¹⁾ _(d) forthe value of w=1.

[0104] The b_(i) of each of the models 1100 would be defined with theindex b^((w)) _(i). Therefore, when these are scaled with the gainmodulator 1410 they would be scaled with the following relationship:$\begin{matrix}{\left( b_{i}^{(w)} \right)_{seated}\frac{\left( b_{i}^{(w)} \right)_{old}{K_{ss}^{(w)}\left( {1 + {\sum\limits_{i = 1}^{n}\quad a_{i}^{(w)}}} \right)}}{\sum\limits_{i = 1}^{n}\quad b_{i}^{(w)}}} & (0026)\end{matrix}$

[0105] Although the index for the steady-state model 200 was set equalto the index for the dynamic model 1100, it should be understood that,even though there are multiple ones of the progressive steady-statemodels 200 and multiple ones of the progressive dynamic models 1100, itis not necessary to match the indices. For example, it could be that theindex w=1 could be matched to the maximum value for the index on thedynamic model 1100, such that F⁽⁵⁰⁰⁾x(t) is matched to the dynamic model1100 with the minimum index G⁽¹⁾u(t), wherein the maximum index isw=500.

[0106] In an alternate embodiment, as illustrated in FIG. 15, the gainof a single steady-state model 1500 is utilized to provide a singlesteady-state gain K_(ss) for all of the gain modulation modules 1410,such that only one steady-state model is required. This is not aprogressive model. Alternatively, the dynamic model for each of theblocks 1400 could be the same with the steady-state models 200 beingprogressive stochastically-related models. Therefore, there would be aplurality of blocks 1502 which contained only the dynamic model 1100,the derivative block 1104 and the gain modulator 1410. The steady-stategain K_(ss) of the model 1500 would be input to each of the blocks 1502.

[0107] In summary, there has been provided a method and apparatus bywhich a stochastical method is utilized for optimizing y(t) with respectto x(t) through the use of averaging over multiple regression modelsF^((w)). This optimization is utilized to provide a single optimalvector for the values of x(t) which constitute inputs to a plant.

[0108] Although the preferred embodiment has been described in detail,it should be understood that various changes, substitutions andalterations can be made therein without departing from the spirit andscope of the invention as defined by the appended claims.

What is claimed is:
 1. A method for determining the optimum operation ofa system, comprising the steps of: receiving the outputs of the systemand the measurable inputs to the system; and optimizing select ones ofthe outputs as a function of the inputs by minimizing an objectivefunction J to provide optimal values for select ones of the inputs;wherein the step of optimizing includes the step of predicting theselect ones of the outputs with a plurality of models of the system,each model operable to map the inputs through a representation of thesystem to provide predicted outputs corresponding to the select ones ofthe outputs which predicted outputs of each of the plurality of modelsare combined in accordance with a predetermined combination algorithm toprovide a single output corresponding to each of the select ones of theoutputs.
 2. The method of claim 1, wherein the optimal value of theoutputs of the plurality of models is determined as a single averagedoptimal output value for each of the select ones of the outputs.
 3. Themethod of claim 1, and further comprising the step of applying theoptimal values of the select ones of the inputs to the correspondinginputs of the system after determination thereof.
 4. The method of claim1, wherein the step of receiving the outputs of the system comprisesreceiving measurable outputs of the system.
 5. The method of claim 1,wherein the step of optimizing comprises a derivative-based optimizationoperation.
 6. The method of claim 5, wherein the step of optimizingcomprises the steps of: determining the average predicted output of theplurality of models <y(t)>; determining the average derivative of theaverage predicted output <y(t)> with regards to the inputs x(t) as∂<y(t)>/∂x(t); the objective function J being a function of <y(t)> anddetermining a derivative of the objective function J with respect to<y(t)> as ∂J/∂<y(t)>; determining with the chain rule the relationship∂J/∂x(t); and determining the minimum of the J.
 7. The method of claim5, wherein the average derivative of the average predicted output isweighted over the plurality of models.
 8. The method of claim 2, whereinthe step of predicting the select ones of the outputs with the pluralityof models of the system comprises predicting the output to a pointforward in time as a trajectory.
 9. A method for optimizing theparameters of a system having a vector input x(t) and a vector outputy(t), comprising the steps of: storing a representation of the system ina plurality of models, each model operable to map the inputs through arepresentation of the system to provide a predicted output, each of themodels operable to predict the output of the system for a given inputvalue of x(t), each model operable to map the inputs through arepresentation of the system to provide a predicted output; providingpredetermined optimization objectives; and determining a singleoptimized input vector value {circumflex over (x)}(t) by applying apredetermined optimization algorithm to the plurality of models toachieve a minimum error to the predetermined optimization objective. 10.The method of claim 9, wherein the step of determining comprisesdetermining the derivative ∂y(t)/∂x(t) of each of the models and thendetermining an average of the derivatives ∂y(t)/∂x(t).
 11. The method ofclaim 10, wherein the step of determining the average of the derivativecomprises determining the weighted average of the derivatives∂y(t)/∂x(t).
 12. The method of claim 11, wherein the step of determiningthe average derivative is defined by the following relationship:${\frac{\partial{\langle{\overset{}{y}}_{q}\rangle}}{\partial{{\overset{}{x}}_{p}(t)}}\left( {\overset{}{x}(t)} \right)} \propto {\sum\limits_{w = 1}^{N_{w}}{\frac{\partial{{\overset{}{y}}_{q}(t)}}{\partial{{\overset{}{x}}_{p}(t)}}{\overset{}{x}(t)}{\prod\limits_{i = 1}^{n}\quad {P\left( {{\overset{}{y}}^{(i)}\left. {{\overset{}{x}}^{(i)},\overset{}{\omega}} \right){P\left( \overset{}{\omega} \right)}} \right.}}}}$

where$\prod\limits_{i = 1}^{n}\quad {P\left( {{\overset{}{y}}^{(i)}\left. {{\overset{}{x}}^{(i)},\overset{}{\omega}} \right)} \right.}$

is the likelihood, P(ω) is a prior distribution of the parameters ω ofthe model, and their product is the posterior distribution.
 13. Themethod of claim 9, wherein the step of storing a representation of thesystem in a plurality of models comprises storing a representation ofthe system in a plurality of non-linear or linear networks, eachoperable to map the input x(t) to a predicted output through a storedrepresentation of the system.
 14. The method of claim 13, wherein thestored representation of the system in each of the plurality ofnon-linear or linear networks are related in such a manner wherein theparameters of each of the linear or non-linear networks arestochastically related to each other.
 15. The method of claim 14,wherein the stochastic relationship is a Bayesian relationship.
 16. Themethod of claim 9, wherein the predetermined optimization algorithm isan iterative optimization algorithm.
 17. The method of claim 9, whereinthe step of determining the single optimized input vector value{circumflex over (x)}(t) comprises determining the derivative of thepredetermined optimization objective relative to the input vector x(t)as ∂J/∂x(t), where J represents the predetermined optimizationobjective.
 18. The method of claim 9, wherein the step of determiningcomprises determining the derivative ∂y(t)/∂x(t) of each of the modelsand then determining an average of the derivatives ∂y(t)/∂x(t).
 19. Themethod of claim 18, wherein the step of determining the averagederivative is defined over a (q, p) matrix by the followingrelationship:${\frac{\partial{\langle{\overset{}{y}}_{q}\rangle}}{\partial{{\overset{}{x}}_{p}(t)}}\left( {\overset{}{x}(t)} \right)} \propto {\sum\limits_{w = 1}^{N_{w}}\quad {\frac{\partial{{\overset{}{y}}_{q}(t)}}{\partial{{\overset{\rightarrow}{x}}_{p}(t)}}{\overset{}{x}(t)}{\prod\limits_{i = 1}^{n}\quad {P\left( {y^{(i)}\left. {x^{(i)},\omega} \right){P(\omega)}} \right.}}}}$

where$\prod\limits_{i = 1}^{n}\quad {P\left( {y^{(i)}\left. {x^{(i)},\omega} \right)} \right.}$

is the likelihood, P(ω) is a prior distribution of the parameters ω ofthe model, and their product is the posterior distribution.
 20. Themethod of claim 19, wherein the step of determining ∂J/∂<x(t)> comprisesthe steps of: determining the weighted average of the predicted outputof each of the models by the following relationship:${\langle{\overset{}{y}(t)}\rangle} \propto {\sum\limits_{w = 1}^{N_{w}}{{F^{(w)}\left( \overset{}{x} \right)}{\prod\limits_{i = 1}^{n}\quad {P\left( {y^{(i)}\left. {x^{(i)},\omega} \right){P(\omega)}} \right.}}}}$

where P(y^((i))|x^((i)), ω)P(ω) represents the posterior probability ofthe model indexed by w, and N_(w) represents the maximum number ofmodels in the stochastic relationship, and wherein the storedrepresentation of the system in each of the plurality of models arerelated in such a manner wherein the parameters of each of the modelsare stochastically related to each other; determining the derivatives∂J/∂<y(t)> as the variation of the predetermined optimization objectivewith respect to the predicted output y(t); and determining by the chainrule the following:${\frac{\partial J}{\partial{{\overset{}{x}}_{p}(t)}}}_{p} = {\sum\limits_{q}{\frac{\partial J}{\partial{\langle{{\overset{}{y}}_{q}(t)}\rangle}}{\frac{\partial{\langle{{\overset{}{y}}_{q}(t)}\rangle}}{\partial{\overset{}{x}(t)}_{p}}.}}}$